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Abstract — For hard computational problems, stochastic local search 
has proven to be a competitive approach to finding optimal or approx- 
imately optimal problem solutions. Two key research questions for 
stochastic local search algorithms are: Which algorithms are effective 
for initialization? When should the search process be restarted? In 
the present work we investigate these research questions in the context 
of approximate computation of most probable explanations (MPEs) in 
Bayesian networks (BNs). We introduce a novel approach, based on 
the Viterbi algorithm, to explanation initialization in BNs. While the 
Viterbi algorithm works on sequences and trees, our approach works 
on BNs with arbitrary topologies. We also give a novel formalization of 
stochastic local search, with focus on initialization and restart, using 
probability theory and mixture models. Experimentally, we apply our 
methods to the problem of MPE computation, using a stochastic local 
search algorithm known as Stochastic Greedy Search. By carefully 
optimizing both initialization and restart, we reduce the MPE search 
time for application BNs by several orders of magnitude compared to 
using uniform at random initialization without restart. On several BNs 
from applications, the performance of Stochastic Greedy Search is 
competitive with clique tree clustering, a state-of-the-art exact algorithm 
used for MPE computation in BNs. 

Index Terms — Stochastic local search, Bayesian networks, initializa- 
tion, restart, finite mixture models. 

1 Introduction 

M ULTI- VARIATE probability distributions play a cen- 
tral role in a wide range of automated reasoning 
and state estimation applications. Multi-variate prob- 
ability distributions can be decomposed by means of 
Bayesian networks [1], factor graphs [2], Tanner graphs, 
Markov random fields [3], [4], arithmetic circuits [5], or 
clique trees [6], [7], [8]. If the resulting graph decomposi- 
tion is relatively sparse, efficient reasoning and learning 
algorithms exist. 
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In this article, we focus on reasoning in the form 
of stochastic local search in Bayesian networks (BNs). 
Stochastic local search (SLS) algorithms are among the 
best known for computationally hard problems includ- 
ing satisfiability (SAT) [9], [10], [11], [12]. SLS algo- 
rithms have also been successful in computing the most 
probable explanation [13], [14], [15], [16], [17], [18] and 
the maximum a posteriori hypothesis [19] in Bayesian 
networks. While the details of different SLS algorithms 
vary [12], by definition they use noise and initialization 
algorithms in addition to hill-climbing; SLS algorithms 
typically also rely on restarts. 

Our focus in this work is on initialization and restart. 
Specifically, we investigate the following research ques- 
tions. How do different SLS initialization algorithms 
impact performance? How can their effect be analyzed? 
What is the impact of the restart parameter? In answer- 
ing these questions, we consider the stochastic greedy 
search (SGS) algorithm, an SLS approach for computing 
MPEs in BNs [14], [15]. The stochastic local search part of 
SGS is a generalization of the GSAT and WALKS AT fam- 
ily of algorithms [9], [20], [10] to the probabilistic setting 
of BNs, and SGS is also related to other SLS algorithms 
for BN computation [21], [13], [19]. Specifically, our con- 
tribution in this work is two-fold. Our first contribution 
consists of two novel initialization algorithms. These 
algorithms are generalizations of the Viterbi approach 
[22] and use Viterbi to exactly compute MPEs for tree- 
structured BNs. Algorithms that exploit tree-structured 
graphs or sub-graphs are well-known [2], [3], [4]; in 
this article we discuss how such algorithms can be 
used for SLS initialization in general BNs. Informally, 
these algorithms construct explanations that are good 
approximations to MPEs, and these explanations make 
up starting points for the hill-climbing phase of SGS. In 
application BNs, we show experimentally that these and 
other initialization algorithms can substantially improve 
performance. Our second contribution rests on the 
belief that the research questions raised above are best 
answered within a solid mathematical framework. Con- 
sequently, we carefully develop a mathematical frame- 
work for SLS analysis, based on probability theory and 
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mixture distributions, and fit mixture models to SLS run- 
length data in experiments. We thus help improve the 
theoretical foundations of SLS; this is significant because 
the theoretical foundations of SLS have lagged compared 
to the impressive experimental performance of these 
algorithms [23], [24], [17]. 

The most closely related research to ours has focused 
on run time variability, SLS restart, and SLS initialization. 
An easy-hard-easy run time pattern has been observed 
for the NP-complete satisfiability (SAT) problem [25], 
and great variability in run times has been observed in 
the hard region of SAT [26]. Even in the easy region, 
there is great variability in the run time for an ensemble 
of problem instances [27]. Based on these observations, 
the benefit of randomized restarts has been established, 
both for SLS [28] and for systematic search [29]. Mo- 
tivated by high run time variability for hard problem 
instances, Bayesian models that predict inference run 
times have been learned and then used to dynamically 
optimize the SLS restart point [30]. In other related 
research, off-line and on-line computations are combined 
to dynamically control restarts [31], and restart policies 
based on beliefs about problem instance hardness are 
used to make search more efficient [32]. Finally, we 
note that initialization has turned out to be crucial in 
making SLS competitive with other approaches to BN 
computation, especially in application BNs [14], [15], 
[19], [13]. 

The rest of this article is organized as follows. Section 2 
contains notation and fundamental concepts. In Section 
3 we introduce our SLS approach including our novel 
initialization algorithms. In Section 4, we analyze our 
approach using techniques from probability theory and 
finite mixtures. Experimental results are presented in 
Section 5. In Section 6 we conclude and present direc- 
tions for future research. We note that earlier versions of 
this research have been reported previously [14], [15]. 

2 Preliminaries 

This section introduces notation and known results re- 
lated to Bayesian networks, the Viterbi approach, and 
tree-based reparametrization. 

2.1 Fundamental Concepts and Notation 

Bayesian networks [1], defined as follows, organize ran- 
dom variables in directed acyclic graphs (DAGs). 

Definition 1 (Bayesian netzvork): A Bayesian network 
(BN) is a tuple /3 = (X, E, P), where (X, E) is a 
DAG with n = |X| nodes, m = \E\ edges, and an as- 
sociated set of conditional probability distributions P = 
{Pr(Xx | H Xl ), - Pr(X„ | HO}. Here, Pr(X ( | U Xi ) 
is the conditional probability distribution for Ij £ I. 
Further, let tv x, represent the instantiation of the parents 
ll X/ of X; . The independence assumptions encoded in 
(X, E) imply the joint probability distribution 

n 

Pr(a:) = n Pr(a;, ; | n Xi ), (1) 

t:=i 


where Pr(a:) = Pr(Xi = x\, . . ., X n = x n ). 

Pr(X,; | 1 1 x, ) is also known as a conditional probability 
table (CPT). The notation tlx is used to represent the 
(here discrete) state space of a BN node X. A BN may be 
given evidence by clamping some nodes to their observed 
states. An instantiation of the remaining nodes is an 
explanation, formally defined as follows. 

Definition 2 (Explanation): Consider a BN /3 = (X, 
E, P) with evidence e = {xi,...,x m } = {X\ = 
Xi,...,X m = a;,,, } . An explanation x is defined as 
x , • . • , x n } {X m _j_i X n 3-n}* 

A sub-explanation y of x is defined as y C x. 

To simplify the exposition, one may regard z = x U e, 
and consider Pr(z) = Pr(a:, e) = Pr(a: | e)Pr(e) instead 
of the closely related Pr(ai | e). The BN 6 is typically left 
implicit when discussing an explanation x for [3. 

Given a BN with evidence or no evidence, various 
forms of BN inference can be performed [1], [6], [7], [33], 
[8], [19]. This article focuses on computing the most 
probable explanation, also known as belief revision [1]. 

Definition 3 (Most probable explanation (MPE)): 
Computing a most probable explanation (MPE) in 
a BN is the problem of finding an explanation x* 
such that Pr(£c*) > Pr (y), where y is any other 

explanation in the BN. The set of the k most probable 
explanations is defined as X* = { x(, .... x). } where 
Pr ( x *) = Pr (x\) = • • • = Pr (x* k ). 

In other words, no other explanation has higher prob- 
ability than x* for 1 < i < k. Several explanations with 
the same probability can exist, and we therefore say "an" 
MPE rather than "the" MPE. 

It can be shown by reduction from SAT that MPE 
computation is NP-hard [34], Approximating an MPE to 
within a constant ratio-bound has also been proven to be 
NP-hard [35]. Since inference in BNs is computationally 
hard and the MPE problem is important in applications, 
it is important to study inexact approaches, including 
stochastic local search (SLS) algorithms, where estimates 
of x* G X* are computed. 

Definition 4 (MPE (lower-bound) estimate): Let x* be an 
MPE. A best-effort estimate of x* is denoted x ; if 
Pr(a:*) < Pr(a:*) then x* is a lower-bound estimate. 

SLS algorithms typically compute lower-bound MPE 
estimates x*. In Section 3.1 we discuss one SLS algo- 
rithm, Stochastic Greedy Search, in more detail. 

2.2 Existing Dynamic Programming Algorithms 

We now discuss forward and backward dynamic pro- 
gramming for chains and trees; we denote these algo- 
rithms TreeFDP and TreeBDP respectively. We present 
this well-known approach due to Viterbi [36], [22] for 
BNs that are chains in some detail. The case of trees is a 
straight-forward extension since different paths down a 
tree are independent and can be treated independently 
using essentially the same algorithm. We introduce the 
following terminology for BNs with such tree topologies. 

Definition 5 (Backivard tree, forward tree): Consider a 
BN f3 = (X, E, P). If the underlying graph (X, E) of 
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/3 is a tree where all non-leaf nodes have one or more 
children, then (3 is a backward tree. If the underlying 
graph (X, E) of (3 is a tree where all non-root nodes 
have one or more parents, then (3 is a forward tree. 

Y — > X <— Z is an example forward tree and Y <— 
X — > Z is an example backward tree; see Figure 2 for 
other backward tree examples. 

To simplify exposition, we now assume BN nodes 
with S = 2 states, say {0, 1}. The approach clearly 
generalizes to S > 2. We consider a BN that is a chain 
Xi — > X 2 X T of length T. The CPTs are 

denoted by Pr(Xi = xf), Pr(X 2 = x 2 \ X- t = x\), . . ., 
Pr(X t = xt | X t _i = xt- i), where x,; G {0,1}- The key 
observation that TreeFDP is based on is the following. 
In order to compute the probability of the most probable 
explanation of a sub-chain of length t < T, it is sufficient 
to know 5 = 2 numbers: (i) the probability of the most 
probable explanation until node number t — 1, assuming 
that the (t — l)-st node is set to 0; and (ii) the probability 
of the most probable explanation until node number t— 1, 
assuming that the (t — l)-st node is set to 1 . 

More specifically, let x t € {0, 1} be the assignment of 
the f-th node. In order to compute an MPE, we operate 
on two arrays, the D-a r ra y containing probabilities (such 
as Pr(a:*)) and the .4 -array containing states (such as 
x*). Let D t (x t ) be the maximal probability of a sequence 
of assignments to nodes Xi, . . . , X t , with X t = x t . Let 
A t _ i(x t ) be the assignment to the node X t -\ in the most 
probable sub-explanation that assigns X t = x t . 

For initialization in the TreeFDP algorithm, let 
Di(xi) = Pr(xi) for X\ € {0,1}- These D\ values are 
simply priors. Also set A 0 (xo) = {}. The recursive step 
is for the /4-arrav 

D t (x t ) = max {A-ifat-i) Pr(x t | x t -\)}, 

x t _i6{0,l} 

while for the /l-array we get the recursive step 

A t _i(x t _i) = arg max {D t _i(x t -\) Pr(x t | x t _i)}. 
x t _ie{o,i} 

The final assignments in the iterative loop are to 
Dt(Xt = 0) and Dt(Xt = 1) for probabilities, and to 
A t _i(X t = 0) and A T _ 1 {X T = 1) for states. 

Computing D T = ma x 2 , r6{0jl} {i4 T (X T = x T )} is 
one of the last steps of the algorithm. Here, we choose 
the best of the two probabilities Dt(Xt = 0) and 
D t (X t = 1). For the states, we similarly compute the 
last assignment At = arg max JTg { 0 x ^{D{Xt = Xt)}- 
Note that this is done after computing At-i(xt)- Given 
the information in the /l -array, one just needs to perform 
a backtracking step in order to compute the MPE x* for 
the chain [22, p. 264, Equation 35]. 

The TreeBDP algorithm is similar to TreeFDP. 
Again, there are two arrays, an array E containing 
probabilities, and an array B containing states. The E- 
array corresponds to the /4-array, while the B - array 
corresponds to the A-array. To save space we refer to 
the literature [36], [22] for further details. 


The following result, adapted from Rabiner [22], sum- 
marizes the performance of TreeBDP and TreeFDP. 

Theorem 1: In a BN that is a backward tree, TreeBDP 
computes an MPE x*. In a BN that is a forward tree, 
TreeFDP computes an MPE x*. 

Viterbi, generalized to arbitrary tree-structured graph- 
ical models, is called max-product belief propagation 
[2]. A tree-based reparametrization framework has been 
developed [3], [4], which includes the belief propagation 
[1] and sum-product algorithms [2] as special cases. 
Within this reparametrization framework, there are ap- 
proximation algorithms for MPEs [3] and marginals 
[4]. Our novel approaches to initialization, discussed in 
Section 3.2.2, are based on TreeBDP and TreeFDP and 
are closely related to the tree-based reparametrization 
algorithm for MPE computation. 

3 Stochastic Local Search (SLS) 

In this section we present our stochastic local search al- 
gorithm SGS in Section 3.1. Our initialization algorithms, 
the forward dynamic programming (GraphFDP) and 
backward dynamic programming (GraphBDP) algo- 
rithms, are discussed in Section 3.2. 

3.1 Stochastic Greedy Search 

Figure 1 presents our Stochastic Greedy Search (SGS) 
algorithm, discussed in more detail elsewhere [15], [17], 
[18]. The structure of SGS is similar to that of the seminal 
WalkSAT family of algorithms [20], [10]. In SGS, as in 
WalkSAT, the MAX-FLIPS restart parameter controls 
the number of flips made before SGS is restarted. A try 
starts with initialization and ends after at most MAX- 
FLIP flips. After MAX-FLIPS flips, if there is a new 
try, a new explanation x from which search restarts is 
randomly generated; this is also similar to WalkSAT. If 
exactly one initialization is performed at the beginning 
of each try, a total of at most MAX-FLIPS + 1 operations 
are performed per try. Two termination criteria are 
displayed in Figure 1; the lower bound l < Pr(a:*) 
and the MAX-TRIES parameter. The latter is easy to 
understand, for the former SGS terminates when the 
probability of the MPE estimate x* exceed the lower 
bound or I’rfx" ) > i. While we assume that £ is an 
input parameter, it can also be estimated during search. 
Other termination criteria can easily be introduced. 

There are, however, some important differences be- 
tween SGS and many other WALKS AT-style algorithms. 
First, SGS searches for an MPE in a BN, not a satisfying 
assignment in a CNF logic formula. Second, SGS com- 
bines greedy and noisy search algorithms in a portfolio 
S with stochastic initialization algorithms in a portfolio 
I, while WalkSAT does not use portfolios. Formally, we 
use the following definition (see [18]): 

Definition 6 (Stochastic portfolio): Let q > 1 and let 
<I> = cy } be a set of algorithms. A stochastic 

portfolio over 4> is a set of q tuples A = ... ,v q } = 

where 0 < p. t < 1, ELi Pi = T 
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SGS (P, l, I, S, MAX-FLIPS, MAX-TRIES) 

Input: f3 Bayesian network (BN) 

t lower bound: Pr(a:*) > t 

I initialization portfolio 

S local search portfolio 

MAX-FLIPS number of flips per try 
MAX-TRIES number of tries 
Output: (b, x*) bG {true,false}; 

x* is estimated optimum 

begin 


x* <— Initialize(I, j3) {initial MPE estimate x*) 

if (Pr(i:*) > £) then return (true, x*) 

i < — 1 

while (i < MAX-TRIES) 

j <— 0 {the Oth operation is this initialization} 
x <— Initialize^, /?) {initialize x) 
if (Pr(a:*) > l) then return (true, cc*) 


j 1 

while (j < MAX-FLIPS) 

x 4— SEARCH(a:, S, /?) {update x } 
if (Pr(a:) > Pr(ai*)) then x* <— x 
if (Pr(a:*) > £) then return (true, x*) 


j <~j + 1 

endwhile 
i <— i + 1 

endwhile 
return (false, x*) 

end 


Fig. 1. The stochastic local search algorithm SGS for 
computing MPEs. SGS operates in two main phases: an 
initialization phase and a local search phase. Initialize 
applies initialization algorithms from I, while Search 
applies search algorithms from S. SGS terminates if a 
high-probability explanation is found or if the number of 
tries exceeds MAX-TRIES. 


and {(f) means that the *-th algorithm <j) i , where 
1 < * < q, is picked (and executed) with selection 
probability pi when an algorithm is selected from A. 
Both I and § are portfolios according to this definition. 


3.2 Initialization Algorithms 

Initialization algorithms play an important role in SGS 
and other SLS algorithms [14], [15], [16]. We present 
below several initialization algorithms for stochastic gen- 
eration of initial explanations. SGS uses a portfolio I of 
initialization algorithm (or operators), and initialization 
of an explanation takes place when INITIALIZE^, 3) is 
invoked. The algorithms discussed in this section, as well 
as other initialization algorithms, can easily be incorpo- 
rated in the initialization portfolio I of SGS. Evidence 
e in a BN is not changed by any of these initialization 
algorithms, however in order to simplify the discussion 
we often do not make this explicit in the following. 


3.2. 1 Related SLS Initialization Algorithms 

Uniform initialization (UN), the basis for our SGS /UN 
experiments in Section 5, assigns initial states indepen- 
dently and uniformly at random to an explanation x. 
More formally, suppose that we have a Bayesian network 
with nodes X . The uniform initialization algorithm goes 
through all nodes X £ X. If X is a non-evidence 
node, each state x £ Qx has a probability of 1 / 1 f of 

being chosen to be part of the initial explanation x. An 
early approach to MPE computation by means of sto- 
chastic methods used uniform initialization and investi- 
gated three randomized search techniques: iterative local 
search, simulated annealing, and genetic search [21]. In 
general, SAT solvers have also employed initialization 
uniformly at random, and innovations have largely been 
made in the area of search heuristics. 

In the late 1990s, the benefit of more advanced SLS 
initialization algorithms when computing MPE became 
clear [13], [14], [15]. 

Forward simulation (FS), the basis for our SGS/FS 
experiments in Section 5, is a well-known BN simulation 
algorithm that operates as follows [37]: Suppose we have 
a Bayesian network where V represents the root nodes 
and C represents the non-root nodes. Using FS, a root 
node V £ V is initialized, i.e. its states chosen for 
inclusion in the initial explanation x , independently at 
random according to the prior distribution Pr(U). A non- 
root node C £ C is initialized randomly according to its 
conditional distribution Pr(C | lie)/ and only after all of 
its parent nodes lie have been initialized. Clearly, both 
UN and FS are O(n). 

Kask and Dechter empirically found strong MPE per- 
formance using greedy search combined with stochas- 
tic simulation [1] after performing initialization using 
the mini-bucket approximation algorithm [13]. This 
algorithm [38] approximates bucket elimination and is 
useful for problem instances with large induced width 
(or treewidth), since bucket elimination has exponential 
space and time complexity in induced width [39]. 

Developing an SLS approach for the MAP problem, 
which generalizes the MPE problem for BNs, Park 
and Darwiche investigated two search algorithms (hill- 
climbing and taboo search) as well as four initialization 
algorithms [19]. Since the MAP problem is strictly harder 
than the MPE problem [19], they in fact use MPE com- 
putation as an initialization algorithm. 

3.2.2 Novel Dynamic Programming Algorithms 

We now turn to GraphFDP and GraphBDP. Unlike 
TREEBDP and TreeFDP, GraphFDP and GraphBDP 
can handle arbitrary BNs and not only trees and chains. 
Both these algorithms split a BN /? into trees, initialize 
each tree independently using Viterbi (see Section 2.2), 
and then collect the sub-explanations for all trees to give 
an explanation for the entire BN 3. More formally, let X 
be the nodes in a BN. These nodes are partitioned into 
k partitions 


X = TiU.-.U T fc , 


( 2 ) 
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Fig. 2. Backward dynamic programming (GraphBDP) 
initialization in a non-tree Bayesian network. To the 
left, the BN is decomposed, by GraphBDP, into two 
backward trees TT = {Xi,X 3 ,X 4 ,X 5 ,X 6 } and T 2 = 
{X 2 ,X 7 }. To the right, the BN is decomposed, again 
by GraphBDP, into two different backward trees Ti = 
and T 2 = {X 2 ,X 4 ,X 5 ,X 6 ,X 7 }. 


where Ti n Tj=0 for i / j. All partitions T,, where 
1 < 1 < are either forward trees or backward trees. 

Without loss of generality, we now assume that 
GraphBDP is used and that all T, u where 1 < i < k, 
therefore are backward trees. Consider the i-th backward 
tree T, = {X h , . . . ,X iN }, where X tj G X. For T, : , 
TreeBDP computes the sub-explanation 

, Xi 1 , . . . , Xi N }, (3) 

which is an MPE for T, according to Theorem 1 . An MPE 
estimate x* for the complete BN 3 can now be generated 
by GraphBDP simply by collecting sub-explanations for 
all k backward trees: 

x* = x\ U . . . U x* k . (4) 

The GRAPHBDP algorithm is presented in Figure 3. 

In constructing the trees {Ti, . . . , T k }, some way to 
introduce randomness is clearly needed in GraphBDP. 
That is, we do not want every explanation generated 
using INITIALIZE to be the same. Randomness is intro- 
duced by constructing depth-first search trees where root 
nodes, and then the children of any node, are picked 
for processing uniformly at random. We call this novel 
approach stochastic depth-first search, STOCHASTICDFS, 
and summarize it as follows. 

Let 3 = (X, E, P) be a BN, V G X a root node, 
and e evidence. The algorithm StOCHASTICDFS(/3, V. e) 
performs a depth-first search where all non-visited chil- 
dren of V are recursively visited uniformly at random. 
STOCHASTICDFS outputs a backward tree T, rooted in 
V and marks all nodes in T, as visited in 3. In T,, 
all nodes reachable from V along a directed path in 
3 are included, except those nodes in 3 that are part 
of backward trees {Tf, . . ., T,.. | } already created by 
STOCHASTICDFS and thus already marked as visited. 

In its first while-loop, GraphBDP decomposes a BN 
3 into trees, starting from the root nodes, resulting in a 
forest of trees. Each tree T in the induced forest of trees 
is then, in the second while-loop, input to TreeBDP, 
and an MPE for T is thus computed. Given the com- 


GRAPHBDP(/3,e) 

Input: 3 Bayesian network (BN) 

e evidence 

Output: x explanation 

begin 

V <— root nodes in 3 

F <— 0 {initialize the forest of backward trees F } 
while V f 0 {treat all root nodes} 

V <— pick random root node from V 
V^V-{V} 

{construct backward tree T with root V :} 

T STOCHASTlcDFS(/3, V, e) {see text} 
Ft-FU{T} {update forest F of trees} 

endwhile 

x 4— 0 {initialize the explanation x} 

V <— root nodes in 3 

while V f 0 {treat all root nodes} 

V <— pick node from V 
V^V-{Vj 

T <— pick, from F, the tree with root V 
y <— TreeBDP(/9, T, e) 
x <— x U {y} {update explanation} 

endwhile 
return x 

end 

Fig. 3. The dynamic programming algorithm that itera- 
tively creates backward trees for a BN and then executes 
the Viterbi algorithm separately on each of these trees. 


bined operation of STOCHASTICDFS and GraphBDP, 
we can show the following. 

Theorem 2: Let 3 be a BN and e = {Xi = x\, . . ., 
X m = x rn } the evidence. The GraphBDP(/ 3, e) algo- 
rithm creates a forest of trees {Ti, . . . , T k } in which each 
node X G X of 3 = (X, E, P) participate in exactly one 
tree T where 1 < i < k. 

Proof: We first show that any node must be a mem- 
ber of at least one tree. In 3 = (X, E, P), where VCX 
are root nodes, suppose for the purpose of contradiction 
that X G X is not in any tree. Obviously, X is either a 
BN root node, X G V, or X is a BN non-root node, 
X G X - V. Case (i): If X G V, it is the root of 
exactly one tree by construction of GraphBDP and there 
is a contradiction. Case (ii): If X is a non-root node, 
X G X — V, it is reachable by STOCHASTICDFS from 
one or more root nodes, say {Vi ,V m j C V, through 
its parent nodes Fix- Some Vi G { V -\ ..... V m } must have 
been the first to have been picked in the first while-loop 
of GraphBDP. At that point, by construction of STO- 
CHASTICDFS, X would eventually have been included 
in Vi's tree, thus giving a contradiction and proving that 
any node must be a member of at least one tree. Now 
we show that X G X cannot be member of two different 
trees. Suppose, for the purpose of contradiction, that 
X G Ti and X G Tj with respective root nodes V, and 
Vj for 1 / j. The only non- trivial case is X f V and 
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X ^ Vj. Without loss of generality, we assume that V, 
was picked before Vj from V in the first while-loop of 
GraphBDP, so i < j. When V was picked, and since 
X € Tj, there must be a path from Vj to X and thus X is 
marked visited upon inclusion in T, . By construction of 
STOCHASTICDFS, X cannot be included in another tree 
Tj for j > i, giving a contradiction. An evidence node Xi 
in e is treated exactly like any other node except that it's 
state Xi is not changed. We have shown that any node 
must be a member of at least one tree and cannot be 
member of two trees; three or more trees follows easily 
in a similar manner, proving the claim. □ 

The second while-loop in GraphBDP invokes 
TreeBDP for each backward tree T,. For each T,, 
TreeBDP then sets up the dynamic programming ar- 
rays, starting from the leaf nodes. The arrays are one 
numerical array E and one state array B as discussed in 
Section 2.2. Then TreeBDP constructs a sub-explanation 
x* — as shown in Equation 3 — by forward propagation 
and back-tracking over these DP arrays. Finally, all sub- 
explanations are collected to form x *; see (4). 

Example 1: Figure 2 illustrates, for a small BN, how 
GraphBDP may decompose a BN into two different 
backward trees. 

Example 1 illustrates the following general result, 
which follows immediately from Theorem 2 and our 
discussion above. 

Corollary 3: Consider a BN /3 = (X, E, P) with 
X = {X\, . . .,X n } and evidence e = {X = x\, . . ., 
X rn = x m } where m < n. GraphBDP(/ 3, e) computes 
an explanation y over all non-evidence nodes Y = 
{X m +i, . . X n } c x. 

GRAPHFDP works similar to GraphBDP, but starts 
its stochastic decomposition of the BN into a forest of 
forward trees from the leaf nodes, constructs DP arrays 
from the root nodes, and then propagates and back- 
tracks. Both GraphFDP and GraphBDP have complex- 
ity 0(n), since each node X g X is processed at most 
once in both algorithms. 

Both GraphBDP and GRAPHFDP are heuristics and 
have limitations. At the same time, this way of gen- 
eralizing beyond the cases of chains and trees is fast 
and produces, as it turns out, good explanations for 
certain tree-like BNs. Since trees are generated in a ran- 
domized fashion, different sub-explanations {x\, . . . , Xk} 
which are MPEs for the individual trees { T\ ..... Tf. } are 
constructed and aggregated to form different candidate 
MPEs. For trees we will get an MPE x*\ generally 
an explanation constructed in this manner is an MPE 
estimate x* . Experimentation is needed to evaluate their 
quality, and we return to this in Section 5. 

4 Theoretical Framework for SLS 

SLS algorithms often have highly variable run times 
depending on the initial explanation and may also be 
restarted at different points of their execution. In this 
section, we carefully analyze SGS, in particular with 
regard to initialization and restart. 


4.1 Fundamentals of SLS Search 

A number of non-negative random variables can be in- 
troduced to characterize the (pseudo-)random behavior 
of SLS algorithms including SGS. Let, for a try, the 
number of initialization operations applied be a random 
variable X and the number of search operations (flips, 
either greedy or noisy) applied be a random variable Y. 
The total number of operations applied in a try is then 
a random variable Z = X + Y. 

While SGS contains an initialization portfolios I, 
we investigate general initialization portfolios in a 
related article [17] and focus here on homogenous 
initialization portfolios. These portfolios, containing a 
single initialization algorithms, have the form I = 
{(ai,0), . . . , (dj, 1), . . . , (ci£,0)}, which can be abbreviated 
I = {{aj, 1)} = {(a, 1)} or SGS/ a. In addition, we want 
to make the value of the restart parameter MAX-FLIPS 
= to explicit. To reflect these two points, we say Z(a, to) 
and Y(a,m) rather than just Z and Y respectively. 1 We 
now obtain the expected number of operations in a try 

E(Z(a, m)) = E(X) + E(Y(a, m )) = 1 + E(Y (a, m )), (5) 

where the last equality holds for SLS algorithms that 
perform exactly one initialization per try, as will be 
assumed in the following. The notation m = oo means 
that there is no restart. 

SGS stops a try when an explanation x* such that 
Pr(x*) > i, which we abbreviate x f , is found or when 
the cutoff MAX-FLIPS is reached. In the former case 
we say that the try is successful, and define success 
probability p s (a, to) accordingly. For the latter case we 
define failure probability pf(a, to). 

Definition 7 (Success, failure of try): Let MAX-FLIPS = 
m. The success probability of an SLS try is 

m 

p s (a, to) := Pr(Y (a, oo) = i). (6) 

i=0 

The failure probability is pf{a,m) := 1 — p s {a, to). 

An MPE estimate x\ might be computed without 
much search, if initialization is strong relative to the 
problem instance. Therefore, summation starts with i = 0 
in (6) to capture the probability of finding an x\ as 
a result of initialization. When there is a good fit be- 
tween a problem instance and an initialization algorithm, 
Pr(F (a, oo) = 0) can in fact be substantial, as we will see 
in experiments in Section 5. 

The expected number of SGS operations executed, 
E(Z(a, to)) is characterized by the following result. The 
assumption p s (a, to) > 0 made below is reasonable since 
if p s (a, to) = 0 then using MAX-FLIPS = m is futile. 

Theorem 4 (Expected number of operations): Suppose 
that p s (a, to) > 0 and let MAX-FLIPS = to. The expected 

1. We could also have added SGS's other input parameters to Z and 
Y , however this would have made for a too tedious notation for the 
purpose of this article, where we focus on initialization and restart. 
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number of SLS operations executed during a run, 

E(Z(a,m)), is given by 


E(Z{a, to)) = 


mp f (a, in) + YhLq * Pr(D (a, oo) = i ) + 1 


p s (a,m) 


(7) 


Proof: We introduce an indicator random variable R, 
and let R = 1 mean SLS success in the first try while R = 
0 means SLS failure in the first try Using conditional 
expectation gives 


E(Z(a, to)) = E(Z(a, m)\R= 0) Pr(f? = 0) 

+ E(Z(a, m) \ R = 1) Pr(f? = 1). (8) 


First, we consider the case where SLS terminates in the 
first try and obtain 


E(Z(a,m) \R = 1) 


i i ^ Pr ( r °°) = ^ 

Ps{a,m ) 


(9) 


where the first term is due to initialization and the 
second term is due to flips and is normalized using 
p s (a,m) since we assume that SLS terminates in this 
try Second, we consider the case where SLS does not 
terminate in the first try and obtain E(Z(a,m) It = 0) 
= E(Z(a,m) + to + 1), which by linearity is 


E(Z(a, to) | R = 0) = E(Z(a, to)) + to + 1. (10) 


Substituting (9), (10), Pr(i? = 0) = pf(a,m), and Pi(R = 
1) = p s (a,m) into (8) and then solving for E(Z(a,m)) 
gives the desired result for a run (7). □ 

Results similar to Theorem 4 have been obtained and 
discussed previously [28], [40]. Theorem 4 is novel in that 
we account for all operations including initialization, 
not only flips Y(a,m ) as earlier [28], [40]. 2 This is 
important when initialization makes up a significant part 
of total computation time, which is the case when small 
values of MAX-FLIPS are optimal or close to optimal. In 
addition, previous analytical results do not contain an 
explicit proof as stated above [28], [40]. 

The distributions of Z(a,m) and Y(a,m ) are also 
known as run length distributions (RLDs) [12]. 3 A re- 
lated random variable, which we will denote C(a, in), 
measures wall-clock time in, say, milliseconds, and is 
known as the run time distribution (RTD). Given the 
additional variability introduced in RTDs, due to em- 
pirical measurements that depend on the software and 
hardware used in addition to the random number of SLS 
operations, we mainly use RLDs in this article. 

The performance of SLS algorithms varies dramati- 
cally as the value of the restart parameter MAX-FLIPS 
varies. Flow can MAX-FLIPS, then, be set such that SLS 
performance is optimized? In answering this question. 


2. Notation used previously is slightly different from ours. Instead 
of E(Y(m)), Parkes and Walser use -EL.m, where v represents a 
problem instance and MAX-FLIPS = m [28]. Schuurmans and Southey 
say E(T ) with restart after f flips instead of E(Y(m)) [40]. 

3. A more descriptive terminology for Y(m) would perhaps be "flip 
length distribution", but we will use the more common "run length 
distribution" in this article. 


one needs to consider what, exactly, to optimize. It 
is clearly reasonable to minimize the expected number 
of SLS operations, and we consequently introduce the 
following definition. 

Definition 8 (Expectation-optimal MAX-FLIPS): Let 
MAX-FLIPS = to and let Z(a,m ) be the random 
number of SLS operations. The expectation-optimal 
value to* for MAX-FLIPS is defined as 

m*(a) = axgmin (E(Z(a, to))) , (11) 

m6N 

with p* 0 (a) = E(Z(a,m*)), often abbreviated in* and /t* 
respectively. 

Occasionally, we minimize the expected number of 
SLS flips E(Y(a,m)) instead of minimizing E(Z(a,m)) 
as above, and define to/) (a) in a similar manner to m*(a) 
and also define p*f(a) = E{Y{a,mf)). 

4.2 Finite Mixture Models of SLS 

We now discuss finite mixture models, which have at 
least two benefits in the SLS setting. First, finite mixture 
models turn out to be a reasonable model of the multi- 
modal nature of SLS run-length distribution (RLD) in 
many instances [26]. Second, finite mixtures allow us to 
improve the understanding of the role of restarts in SLS. 

4.2. 1 General Case of Finite Mixture Models 

In order to make further progress, and supported by 
previous research [26] as well as experiments in Section 
5, we now introduce the assumption that an RLD may 
be characterized as a finite mixture of n components. 

Definition 9 (SLS finite mixture model): Let MAX-FLIPS 
= to, assume a homogenous initialization portfolio 
{(a, 1)}/ an d let F., (for 1 < i < n) be a cumulative 
distribution function. An SLS finite mixture model of 
the RLD Z (a, to) is defined as 

F(z; a, to) = 7Ti(a, m)Fi(z ; a, m)+- • -+7r K (a, m)F K (z\ a, to), 

(12) 

where Yl= i n z (a,m) = 1. 

Without loss of generality, we assume that the distribu- 
tions F- t , ... . F k are ordered according to their respective 
means: p 1 < p 2 — • • • < I j k- Further, when no restart 
is used we may say m(a)Fi(z; a) and Z(a), and when 
the initialization algorithm a is clear from the context or 
assumed fixed we may say 7r i(m)Fi(z;m) and Z(m). 

An interesting special case of (12) is to model an 
SLS run length distribution as a two-component finite 
mixture distribution 

F(z; a, in) = ni(a, m)Fi(z; a, to) + ^(a, m)Fo{z ; a, in). 

(13) 

Now, for the case i — Pr(a:*), ni and F-\ represent ini- 
tializations that are close to the MPEs X*, while tt 2 and 
F 2 represent initializations farther away from X*. The 
characteristics of 7 ^, F \ , 7r 2 , and F 2 will depend on the 
RLD at hand, which again depends on the BN, the SLS 
initialization algorithm, and the other SLS parameters. 
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The idea is that a strong initialization algorithm yields a 
larger 7Ti and an h\ that is more skewed towards shorter 
run lengths compared to a weak initialization algorithm. 

The formalization above, using finite mixtures, im- 
proves our understanding of the following well-known 
strategy: Initialize and then run the SLS algorithm for 
a "small" number of steps. If the initial explanation 
x turned out to be "close" to an optimum, then the 
SLS algorithm exhibits the ttiFi(z) case. If the initial 
explanation x turned out to be "far away" from an 
optimum, then we have the Tv^F^iz) case. In (13), the 
term ttiFi(z) will in general be of greatest interest to us 
since it represents successful initializations. Intuitively, 
the idea is to set MAX-FLIPS such that the SLS algorithm 
uses, in (13), the F\ distribution repeatedly rather than 
"wait around" till the F 2 distribution takes effect. 


4.2.2 Special Cases of Finite Mixture Models 

The specific mixture models we discuss below are for 
continuous random variables, while in Section 4.1 we 
considered discrete random variables. Consequently, we 
note that E(Z(m)) (see Theorem 4) can be approximated, 
using continuous random variables, as 




m (1 - F{m)) + f n m zf{z)dz + 1 
F(m) 


(14) 


where F{z) is a cumulative distribution function and 
f(z) the corresponding probability density function. 

We now discuss finite mixtures of exponential distri- 
butions, which is of interest for several reasons. First, 
RLDs that are close to exponential have been observed in 
experiments [23], [26], [12], For example, an empirical 
RLD for a hard SAT instance containing 100 variables 
and 430 clauses was generated using WalkSAT and 
found to be well-approximated by an exponential distri- 
bution with fi = 61,081.5 [12]. Second, the exponential 
distribution, and its discrete counter-part the geometric 
distribution, are memoryless. Restarting SLS algorithms 
whose RLDs almost follow these distributions is there- 
fore of little use [40], and they are thus interesting as 
bounding cases for restart. Third, it is relatively easy to 
analyze the exponential distribution. 

The exponential distribution has only one parameter 
and its mode is at zero. The probability of finding x“ e 
within zero or a few flips is, on the other hand, often 
extremely small (see Figure 4 and Figure 6 for exper- 
imental evidence). For such situations, the normal (or 
Gaussian) and log-normal distributions, and their finite 
mixtures, may be more suitable and are used extensively 
in Section 5, see Table 1 and Table 2. A random variable 
is log-normal if its logarithm is normally distributed. 
The normal and log-normal distributions both have 
two parameters, controlling location and spread, making 
them well-suited to handling the type of right-shifting 
needed to model the low-probability left tails that can be 
empirically observed for SLS algorithms including SGS. 


5 Experiments with SLS 

We now consider experimental results for SGS. Section 
5.1 discusses our experimental approach. In Section 5.2 
we characterize the behavior of different SLS initial- 
ization algorithms, including our novel Viterbi-based 
approach, using finite mixture models and other tech- 
niques. In Section 5.3 and Section 5.4 we empirically 
investigate the effect of restarts, and in particular vary 
and optimize the MAX-FLIPS parameter in the context 
of different initialization algorithms. We compare the 
performance of SGS and clique tree clustering, a state- 
of-the-art exact method, in Section 5.5. 

5.1 Experimental Methodology 

The main purpose of the empirical component of this 
research is scientific experimentation rather than compet- 
itive experimentation [41]. 4 While we have highlighted 
the importance of initialization and introduced a novel 
Viterbi-based initialization approach, the purpose of our 
experiments is not to show that this approach is faster 
than existing algorithms on all problem instances (com- 
petitive experimentation). Rather, we aim to complement 
the discussion earlier in this article (scientific experimen- 
tation), and in particular provide further details regard- 
ing the effect of using different initialization algorithms 
and varying MAX-FLIPS. 

There is evidence that a problem instance that is hard 
for one SLS algorithms is also hard for other SLS algo- 
rithms [24], therefore we here investigate one SLS system 
in depth instead of performing superficial experiments 
with a large number of SLS systems. The SLS system 
used for experimentation was an implementation of Sto- 
chastic Greedy Search (SGS) [14], [15]. Initialization algo- 
rithms discussed in Section 3 were used in I , namely UN 
- based on uniform initialization; FS - based on forward 
simulation; BDP - based on GraphBDP; and FDP - 
based on GraphFDP. Instance-specific search portfolios 
S were used in the experiments. These § always con- 
tained noisy search algorithms with non-zero selection 
probabilities (see Definition 5.5), thus SGS could always 
escape local but non-global maxima even for MAX-FLIPS 
= 00 . Input parameters f = Pr(a:*), e = {}, and MAX- 
TRIES = 00 were given to SGS (except in Section 5.5 
where MAX-TRIES < 00 was used). The benefit of MAX- 
TRIES = 00 , often called the Las Vegas approach [42], 
[43], [12], is that one does not confound empirical results 
with the question of when to terminate. Using SGS, at 
least 1,000 repetitions (with different random seeds) were 
performed for each experiment reported here. In some 
cases, up to 10,000 or 100,000 repetitions were run. 

We investigate the performance of SGS on BNs from 
applications; see elsewhere for synthetic BN results 
[17], [18]. The 10 distinct application BNs considered, 
most of which are taken from Friedman's Bayesian 

4. Hooker in fact uses the term "testing" rather than "experimenta- 
tion" [41]; however we here prefer the latter term. 
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Network Repository, are shown in Table 5. (At the 
time of this writing, the location of the Bayesian Net- 
work Repository is at http://www.cs.huji.ac.il/labs/ 
compbio/Repository/.) We discuss in greatest detail the 
Muninl, Pir3, and Water BNs. The Munin BNs are from 
the medical field of electromyography. The Pir BNs 
perform information filtering for the purpose of improv- 
ing battlefield situation awareness [44]. The Water BN 
models the biological processes of water purification. 

An Intel Pentium 4 2GHz CPU with 1GB of RAM, 
running Windows XP, was used in experiments. 

5.2 Varying the Initialization Algorithm 

The purpose of the initialization experiments was to 
study in detail the effect, on empirical RLDs Z(a, oo), or 
Z(a), and Y(a, oo) = Z(a, oo) — 1, or Y{a), of using dif- 
ferent initialization algorithms. Three variants of SGS /a, 
namely SGS/UN, SGS/FS, and SGS/BDP, were used 
[14], [15]. Each SGS variant was tested using three BNs 
— Muninl, Pir 3, and Water — giving a total of nine 
combinations. For each combination, 1,000 repetitions 
were executed. Following standard SLS methodology, 
no restarts were done at this stage [26], [12]. 

Results of these experiments are presented in Figure 
4, Table 1, and Table 2. Table 1 and Table 2 were created 
using the WEKA implementation [45] of the expectation 
maximization (EM) algorithm. WEKA was used in two 
different modes, namely (i) to compute the optimal num- 
ber of finite mixture components n* by means of cross- 
validation (see Table 1) and (ii) to compute the mixture 
parameters for a fixed number of mixture components 
(see Table 2). Since normality is assumed by this variant 
of EM, the ln-transformation performed as indicated 
in Table 1 means that log-normality is implied. These 
experiments are complementary to previous experiments 
by Hoos, where finite mixtures were used to model SLS 
performance on SAT instances from the hard region [26]. 
In particular, Hoos' experiments focused on synthetic 
SAT instances rather than application BNs and used at 
most two mixture components in experiments [26]. 

For the Muninl BN, SGS/FS is the best performer. On 
average, SGS/FS, uses only 32% of the flips required for 
SGS/UN in order to find an MPE in these experiments. 
By investigating the RLDs in Figure 4, it becomes clear 
that SGS/FS has a significant portion of much shorter 
searches than SGS/UN and SGS/BDP. This is reflected 
in Table 1, where the left-most mixture component has 
for the raw data 7Ti = 0.43, jl 1 = 21.84, and hi = 21.84. 
For SGS/UN and SGS/BDP, there is no similar effect of 
short search lengths. The 25th percentile is 87,215 flips 
for SGS/UN and 65,215 flips for SGS/BDP. In Table 2, 
there is for SGS/FS and Muninl a prominent drop in the 
log-likelihood from k = 1 to re = 2, indicating that much 
RLD probability mass is quite accurately accounted for 
when there are two components in the mixture. 

For the Pir3 BN, SGS / BDP is extremely strong. In fact, 
in over 50% of the cases an MPE is found as a result 



1 100 10,000 1 , 000,000 

Operations 


SGS, Pir3 BN, MAX-FLIPS unbounded 



SGS, Water, MAX-FLIPS unbounded 



Fig. 4. Varying the initialization algorithms for three BNs. 
For each BN, three different variants of SGS are tested, 
namely SGS/UN, SGS/FS, and SGS/DP. An empiri- 
cal run-length distribution (RLD) is displayed for each 
combination. Each BN has a best initialization algorithm, 
namely SGS/FS for Muninl (top), SGS/BDP for Pir3 
(middle), and SGS/FS for Water (bottom). 


of initialization (zero flips). Table 1 reflects this strong 
performance, with 7Ti = 0.74 and /% = 0.46. In contrast, 
the averages for Pir3 for SGS/UN and SGS/FS are 
E(Y(a, oo)) = 34,469 flips and E(Y(a, oo)) = 32,249 flips 
respectively. Pir3's backward tree-like structure appears 
to explain the strong results for SGS/BDP. 

For the Water BN, SGS/FS is on average approxi- 
mately twice as fast as SGS/UN and SGS/BDP. For 
SGS/FS, we see in Figure 4 a rapid increase in the 
empirical RLD until approximately 10 operations. In 
fact, for Water the 1st percentile is 3 flips and the 5th 
percentile is 6 flips. Table 1 and Table 2 reflect similar 
trends for SGS/FS compared to its alternatives. 

These experiments show that initialization algorithms 
can have a major positive impact on the SGS inference 
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BN 

Init. 

Ln 

K* 

7T1 

Ai 

O-l 

7T2 

A2 

02 

LL* 

AIC* 

Muninl 

UN 

n 

5 

0.22 

58,206.54 

21,709.53 

0.31 

113,821.39 

25,923.72 

-12.81 

53.61 

Muninl 

UN 

y 

2 

0.31 

11.45 

0.80 

0.69 

11.93 

0.57 

-1.03 

12.06 

Muninl 

FS 

n 

8 

0.43 

21.84 

7.00 

0.07 

979.74 

766.79 

-9.15 

64.30 

Muninl 

FS 

y 

5 

0.43 

3.04 

0.32 

0.03 

5.14 

1.43 

-1.88 

31.75 

Muninl 

BDP 

n 

5 

0.26 

51,803.76 

16,944.68 

0.36 

96,441.14 

27,014.92 

-12.62 

53.25 

Muninl 

BDP 

y 

1 

1.00 

11.58 

0.69 




-1.05 

6.09 

Pir3 

UN 

n 

5 

0.21 

8,311.19 

3,047.46 

0.35 

20,489.07 

7,239.97 

-11.36 

50.72 

Pir3 

UN 

y 

3 

0.36 

9.25 

0.57 

0.37 

10.25 

0.39 

-1.24 

18.49 

Pir3 

FS 

n 

6 

0.22 

7,830.90 

2,911.51 

0.25 

16,735.10 

4,370.79 

-11.29 

56.58 

Pir3 

FS 

y 

2 

0.54 

9.53 

0.71 

0.46 

10.65 

0.58 

-1.26 

12.51 

Pir3 

BDP 

n 

2 

0.74 

1.29 

0.46 

0.26 

4.16 

2.72 

-1.48 

12.96 

Pir3 

BDP 

y 

1 

1.00 

0.47 

0.61 




-0.92 

5.85 

Water 

UN 

n 

8 

0.18 

169.94 

110.59 

0.26 

748.22 

332.40 

-9.00 

64.00 

Water 

UN 

y 

5 

0.06 

3.79 

0.40 

0.23 

5.69 

0.66 

-1.77 

31.54 

Water 

FS 

n 

4 

0.20 

14.10 

11.58 

0.36 

320.87 

232.10 

-8.08 

38.17 

Water 

FS 

y 

5 

0.20 

2.28 

0.66 

0.29 

5.15 

0.73 

-2.08 

32.15 

Water 

BDP 

n 

8 

0.17 

441.78 

217.74 

0.31 

1,252.41 

411.89 

-9.04 

64.09 

Water 

BDP 

y 

2 

0.44 

7.00 

1.08 

0.56 

8.01 

0.84 

-1.48 

12.96 


TABLE 1 

Mixture models computed using the EM algorithm over SGS run length data, using three different initialization 
algorithms and three different BNs. Cross validation results for both raw data and ln-transformed data are displayed 
(Ln column), as are the number of mixture components (k*), the statistics for the one or two left-most mixture 
components, as well as the log-likelihoods (LL*) and Akaike information criterion (AIC*) values. 


BN 

Init. 

Ln 

K — 1 

K = 2 

K = 3 

K* 

LL* 

Muninl 

UN 

n 

-13.02 

-12.85 

-12.82 

5 

-12.81 

Muninl 

FS 

n 

-12.82 

-9.50 

-9.31 

8 

-9.15 

Muninl 

BDP 

n 

-12.94 

-12.70 

-12.64 

5 

-12.62 

Pir3 

UN 

n 

-11.65 

-11.44 

-11.38 

5 

-11.36 

Pir3 

FS 

n 

-11.65 

-11.40 

-11.33 

6 

-11.29 

Pir3 

BDP 

n 

-2.06 

-1.48 

-1.98 

2 

-1.48 

Water 

UN 

n 

-9.71 

-9.28 

-9.11 

8 

-9.00 

Water 

FS 

n 

-9.41 

-8.60 

-8.30 

4 

-8.08 

Water 

BDP 

n 

-9.60 

-9.20 

-9.11 

8 

-9.04 


TABLE 2 

Mixture models generated using the EM algorithm over 
SGS run length data, using three different initialization 
algorithms, for the BNs Muninl, Pir3, and Water. 
Log-likelihoods (LLs) for k = 1, 2, 3 mixture components 
are shown. The two right-most columns show the 
number of mixture components n* and log-likelihood LL* 
computed by cross validation. 


time. For each of these BNs, there is one initialization 
algorithm that substantially outperforms the traditional 
approach of initializing uniformly at random, here im- 
plemented in SGS/UN. For some initialization algo- 
rithms, a non-trivial percentage of searches turned out 
to be relatively short as shown in Figure 4. For Pir3, 
SGS/BDP has a 99th percentile of 9 flips; for Muninl, 
SGS/FS has a 25th percentile of 21 flips; and for Water, 
SGS/FS has a 25th percentile of 91 flips. 

These experiments exhibit clear evidence of mixture 
distribution behavior, aligning well with the discussion 
in Section 4.2. We see in Table 1 that the optimized 
number of mixture components, n* , ranges from 1 to 
8. In particular, in Table 1, the use of two or more 
mixture components was found to be optimal in most 
cases: k* > 2 in 16 out of 18 rows in the table. 


5.3 Varying the Restart Parameter 

The purpose of the restart experiments was to investi- 
gate the effect, on empirical RLDs Y(a,m), of varying 
the SLS restart parameter MAX-FLIPS. The MAX-FLIP 
parameter was set to MAX-FLIPS = 20, MAX-FLIPS = 30, 
MAX-FLIPS = 50, MAX-FLIPS = 100, and MAX-FLIPS 
= 200, and for each condition the MPE was computed 
1,000 times using SGS while gathering statistics for RLD 
estimates Y(a,m). 

In Figure 5, the Y(a,m) results for SGS/FS using the 
Muninl BN are shown. Varying MAX-FLIPS has a major 
impact, since for the unbounded case MAX-FLIPS = oo 
the RLD is almost horizontal for RLD-values slightly 
greater than 0.4. For the larger values of MAX-FLIPS, 
except for the unbounded MAX-FLIPS case, we observe 
a stair-case-shaped RLD, where steps get progressively 
smaller as the number of flips increases. Clearly, the 
reason for this pattern is the substantial probability mass 
in the left tail. This is also shown in Table 1, where the 
left-most mixture component has for SGS/FS 7Ti = 0.43, 
/h = 21.84, and a 1 = 7.00. 

Results for Y(a,m) for SGS/FS using the Water BN 
are shown in Figure 5. While the effect here is not 
as dramatic as for Muninl, MAX-FLIPS = 20 gives, in 
general, visibly better performance than the other MAX- 
FLIPS settings. Again, this behavior reflects Table 1, 
where the left-most mixture component has for SGS/FS 
7r i = 0.20, /t-L = 14.10, and b\ = 11.58. 

For Pir3, varying the restart parameter had a minimal 
effect on Y (a, m), as one might expect given the similar- 
ity of the RLD to an exponential distribution (see Figure 
4), and to save space we do not report details. 

In summary, we have seen that varying the MAX- 
FLIPS parameter can have a substantial impact on SGS 
run lengths for certain BNs. This leads to the following 
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♦ MAX-FLIPS=20 
□ MAX-FLIPS=30 
a MAX-FLIPS=50 

MAX-FUPS=100 

* MAX-FUPS=200 
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SGS/FS, Water BN 



♦ MAX-FLIPS=20 
□ MAX-FLIPS=30 
a MAX-FLIPS=50 

MAX-FLIPS=100 

* MAX-FLIPS=200 
o MAX-FLIPS unbounded 


SGS/FS, Muninl BN, Mixture models 



Fig. 6. Empirical run length distribution (RLD) for 
SGS/FS for Muninl along with four mixture models, with 
varying number of mixture components. 



Baseline SGS 

Optimized SGS 

Speedup ratio 

BN 

Flips 

MF 

Flips 

MF 

Flips 

Time 

Muninl 

162,373 

oo 

61.09 

30 

2,680:1 

2,656:1 

Pir3 

34,469 

CO 

0.8621 

6 

39,982:1 

6,629:1 

Water 

3,121 

oo 

70.59 

9 

44.2:1 

38.34:1 


Fig. 5. Empirical run length distributions (RLDs) for SGS 
when varying the restart parameter from MAX-FLIPS=20 
to MAX-FLIPS unbounded. 


question. What are the optimal or near-optimal values of 
MAX-FLIPS for such BNs? This issue is what we turn 
to in the experiments of the next section. 

5.4 Optimization of Initialization and Restart 

The purpose of the optimization experiments was to 
empirically find optimal or near-optimal values of 
MAX-FLIPS, thus minimizing E(Y(a,m)). The speed- 
up that can be obtained by optimizing both initialization 
and restart was of interest. Compared to traditional 
optimization, we note that SLS restart parameter opti- 
mization is complicated by the noise in the objective 
function caused by SLS randomization. 

Based on pilot studies reported in Section 5.3, we 
investigated optimization of SGS for Muninl, Pir3, and 
Water in detail. Using SGS/FS and Muninl, MAX- 
FLIPS was varied from MAX-FLIPS = 20 to MAX-FLIPS 
= 40. For each setting, 10,000 Las Vegas run length 
experiments were performed. Both the number of flips 
and computation times were recorded, and then the 
sample averages for number of flips E(Y(a,m )) and 
execution times E(C(a,m)) were computed. 

Results from these Muninl experiments are shown in 
Figure 7. Figure 7 contains, in addition to the data points, 
4-th order polynomial regression results. Similar results 
are reported for the Water BN. Given the approximate 
polynomials above one can find approximate optima by 
setting the derivative to zero, fix) = 0, and solving for 
x, giving results that are very similar to those obtained 


TABLE 4 

Optimizing performance of SGS for three BN instances. 


using just run length. Further insight and confirmation 
is provided by Figure 6 and Table 3, which show raw 
data along with normal mixture models. Using the prob- 
abilistic approach discussed in Section 4, in particular 
(14), these mixture models were used to obtain estimates 
m*(FS) as summarized in Table 3. 

A summary of the results for the different BNs is 
provided in Table 4. The baseline SGS system is using 
uniform initialization and no restart. Optimized SGS is 
using the best initialization algorithm (for that particular 
BN) and an optimized value for MAX-FLIPS (MF). For 
the Pir3 BN, the mean of the baseline approach SGS / UN 
was 34,469 flips while optimized SGS/BDP used 0.8621 
flips. In terms of wall-clock time, optimized SGS/BDP 
was 6,629 times faster than unoptimized SGS/UN. 
There are similar, but less dramatic, speedups for the 
Muninl and Water BNs. 

Overall, the speedups obtained by carefully optimiz- 
ing initialization and restart are quite dramatic for these 
application BNs. Initialization is more important than 
restart, in the sense that strong performance can be 
obtained for a particular initialization algorithm a for 
SGS /a even without optimizing MAX-FLIPS. 

5.5 Comparison to Clique Tree Clustering 

The purpose of these experiments was to compare SGS 
with clique tree clustering [6], [7], [33], [8] as imple- 
mented in the state-of-the-art system FIugin. Clique tree 
clustering is an exact algorithm that consists of two 
phases, the clustering phase and the propagation phase. 
The clique tree propagation (CTP) phase operates on a 
clique tree, created from a BN by the clustering phase. 
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89,780 










-12.82 

- 
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0.433 

21.94 

7.166 

0.567 

91,795 

102,816 







-9.50 

32 

3 

0.432 

21.92 

7.129 

0.231 

15,029 

13,620 

0.337 

144,449 

104,328 




-9.31 

32 

4 

0.432 

21.91 

7.108 

0.157 

7,384 

7,025 

0.211 

55,954 

31,719 

0.200 

19,5384 

10,5862 

-9.25 

32 


TABLE 3 

Mixture models, with k = 1 to k = 4 components, generated using the expectation maximization (EM) algorithm. 



SGS/FS, Water BN, Average Time 


O Average Time Poly . {Average Time) 



Fig. 7. Varying the MAX-FLIPS restart parameter for 
SGS/FS on the BNs Muninl (top) and Water (bottom). 
Average computation time is displayed for each BN. Ap- 
proximations using 4-th order polynomials, based on the 
experimental data points, are also shown. 


Clique tree size upper bounds treewidth, a fundamental 
graph-theoretic parameter [46] of a BN. 

Table 5 presents experimental results for 10 distinct 
application BNs for both SGS and clique tree propaga- 
tion. In these experiments, where the SGS system was 
optimized using our technique discussed earlier in the 
article, we only account for on-line run time, not pre- 
processing time. For HUGIN, pre-processing amounts 
to compilation of a BN into a clique tree. For SGS, 
pre-processing amounts to optimizing the parameters 
controlling search. The results in Table 5 have for FlUGIN 
been averaged over 50 runs, for SGS over 1,000 runs. 

Total clique tree size and inference time, as computed 
by clique tree clustering, is an indication of BN hardness. 
Among these BNs, Muninl has the largest total clique 
tree size and the slowest execution time. SGS clearly 


BN 

SGS 

CTP 

Name 

m/n 

Init. 

MF 

Time (s) 

Size (k) 

Time (s) 

Mildew 

1.31 

FDP 

200 

0.257 

9,566 

3.88 

Muninl 

1.49 

FS 

30 

0.0254 

384,621 

1824 

Munin2 

1.24 

FS 

113 

1.98 

4,862 

2.75 

Munin3 

1.26 

- 

- 

- 

3,113 

0.744 

Munin4 

1.34 

- 

- 

- 

14,340 

3.19 

Pigs 

1.34 

FDP 

550 

3.41 

828 

15.7 

Pirl 

1.02 

BDP 

2 

0.0205 

11 

0.0197 

Pir2 

1.00 

BDP 

8 

0.0100 

8 

0.0367 

Pir3 

1.17 

BDP 

6 

0.00216 

5 

0.00096 

Water 

2.06 

FS 

9 

0.00837 

3,657 

0.772 


TABLE 5 

Comparison of performance of SGS and clique tree 
propagation (CTP) on different BN instances. 


outperforms CTP on Mildew, Muninl, Pigs, Pir2, and 
Water. CTP is clearly superior to SGS for Munin3 and 
Muninl, as SGS did not terminate within the allocated 
number of MAX-TRIES for these BNs. 

We now compare the performance of the novel ini- 
tialization algorithms proposed in this paper, BDP and 
FDP, with FS. A key difference is that BDP and FDP 
are structure-based, while FS is CPT-based. As a con- 
sequence, BDP and FDP can be expected to perform 
very well for BNs that are tree-structured or close to 
tree-structured, and not so well otherwise. FS, on the 
other hand, does not rely on structure. The experimen- 
tal results confirm these qualitative statements: Table 5 
illustrates how BDP and FDP are the best performers, 
relative to FS, for BNs with substantial tree structure, 
reflected in a small ratio m/n for a BN 3 = ( X , E, P), 
where n = \ X | and m = \ E \ . 

In summary, these experiments clearly show that 
our stochastic greedy search algorithm — specifically 
SGS/FS, SGS /BDP, and SGS /FDP — performs very 
well on non-trivial BNs. In particular, these experiments 
show competitive performance on BNs that are not triv- 
ial for clique tree clustering, a state-of-the-art algorithm 
for BN inference. 

6 Discussion and Future Work 

Stochastic local search (SLS) algorithms provide, for 
certain classes of applications and problem instances, an 
appealing trade-off of accuracy for speed and memory 
requirements. Specifically, SLS algorithms require little 
memory and are very fast for certain problem instances, 
but are incomplete and may produce sub-optimal results. 
By focusing on computing most probable explanations in 
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Bayesian networks, we have in this article presented al- 
gorithmic, analytical and experimental results regarding 
SLS initialization (where to start search?) and the SLS 
restart parameter (when to re-start search?). In particu- 
lar, we have discussed Viterbi-based initialization algo- 
rithms, an analytical framework for SLS analysis, and 
analysis of Stochastic Greedy Search (SGS) specifically, 
and finally SGS's competitive performance relative to 
clique tree clustering. By carefully and jointly optimizing 
the initialization algorithm and the restart parameter 
MAX-FLIPS for SGS, we improved for application BNs 
the search performance by several orders of magnitude 
compared to initialization uniformly at random with no 
restart (or MAX-FLIPS = oo). 

We now consider, for SGS, the optimization of the 
initialization algorithm a and the MAX-FLIPS parameter. 
This optimization — which is reflected in the progression 
of Section 5.2, Section 5.3, and Section 5.4 — is a heuristic 
pre-processing step. First, as discussed in Section 5.2, 
one typically optimizes the selection probabilities in 
the initialization portfolio I. This optimization is partly 
informed by the structure of the BN, such that the 
probability of picking FDP (say) from the initialization 
portfolio is set higher for a BN that is close to having a 
forward-tree structure. The probabilities are then grad- 
ually adjusted, as one sees the impact of the different 
initialization algorithms on the progress of the stochastic 
search process, until one is left with a homogenous ini- 
tialization portfolio SGS /a. After a has been identified, 
the emphasis typically shifts to the optimization of the 
MAX-FLIPS parameter, utilizing the mixture distribution 
properties of the run-length distributions identified in 
this article. Empirical aspects of this optimization are 
investigated in Section 5.3 and Section 5.4. 

Our results appear to create new research opportuni- 
ties in the area of adaptive and hybrid SLS algorithms. 
One opportunity concerns the development of new ini- 
tialization algorithms. For example, it would be of inter- 
est to investigate more fine-grained hybrid algorithms, 
where different initialization algorithms may be applied 
to create different sub-explanations. Another question is 
whether SGS performance can be further improved by 
automatically optimizing the input parameters. Finally, 
additional work on finite mixture models appears useful. 
Most current approaches assume, as we have done here, 
homogenous mixtures. We speculate that heterogenous 
mixture models, where one distribution function F\ is 
(say) normal while another distribution function is 
(say) exponential, could provide improved fits for certain 
run-length distributions, and would be useful in other 
contexts as well. 
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